Optimizing Query Processing in Batch Streaming System
نویسندگان
چکیده
With the growing need of processing “big data” in real time, modern streaming processing systems should be able to operate at the cloud scale. This imposes challenges to building large scale stream processing systems. First, processing tasks should be efficiently distributed to worker nodes with small overhead. Second, streaming data processing should be highly available, despite that failures are common in datacenters. In Spark Streaming [26], the DStream model is proposed to cope the problems aforementioned. DStream stands for discretized stream; data in the incoming stream is divided into small batches for processing. Compared with processing data at the granularity of a record, batch processing has much lower overhead and has a cheaper fault tolerance model. Lineage information of each batch is kept for recomputation when failure occurs. Therefore, fault tolerance can be achieved without duplicating processing nodes. In this paper, we discuss how to optimize query processing in the DStream model. Specifically, we consider the case of Structured Query Language (SQL). SQL provides a declarative interface for the users query on the data. The declarative nature of SQL provides opportunity for query optimization as the execution is decoupled from the semantics of the query. In a streaming system, the same query is executed on similar data over and over again. Hence, the statistics of the data could be obtained for free, as long as the incoming data pattern is not changing abruptly. We study the performance of applying query optimization techniques in the DStream model, and show the advantage of dynamically optimizing stream processing.
منابع مشابه
Two Architectures for Parallel Processing of Huge Amounts of Text
This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The st...
متن کاملShared Query Processing in Data Streaming Systems
Shared Query Processing in Data Streaming Systems by Saileshwar Krishnamurthy Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael J. Franklin, Chair In networked environments there is an increased proliferation of sources (e.g., seismic sensors, financial tickers) that produce live data streams. As a consequence, systems that can manage streaming data h...
متن کاملDesign and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کاملOptimizing Latency and Throughput Trade-offs in a Stream Processing System
The value of stream processing systems stems largely from the timeliness of the results these systems provide. Early stream processors followed the record-at-a-time approach, servicing each data point as soon as it arrives at the system. While these systems provide good latency, their behaviors become less desirable when applications require high throughput, fault tolerance, or usage of statefu...
متن کاملCustomer Order Scheduling with Job-Based Processing and Lot Streaming In A Two-Machine Flow Shop
This paper considers a customer order scheduling (COS) problem in which each customer requests a variety of products processed in a two-machine flow shop. A sequence-independent attached setup for each machine is needed before processing each product lot. We assume that customer orders are satisfied by the job-based processing approach in which the same products from different customer orders f...
متن کامل